Unsupervised Structured Data Extraction from Template-generated Web Pages
نویسندگان
چکیده
This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle the problem of automatic structured Web data extraction we present a new approach – structured data extraction based on clustering visually similar Web page elements. Our method called ClustVX combines visual and pure HTML features of Web page to cluster visually similar Web page elements and then extract structured Web data. ClustVX can extract structured data from Web pages where more than one data record is present. With extensive experimental evaluation on three benchmark datasets we demonstrate that ClustVX achieves better results than other state-of-the-art automatic structured Web data extraction methods.
منابع مشابه
Using XPaths of inbound links to cluster template-generated web pages
Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consum...
متن کاملRoadRunner for Heterogeneous Web Pages Using Extended MinHash
The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structure...
متن کاملAutomatic Record Extraction for the World Wide Web
As the amount of information on the World Wide Web grows, there is an increasing demand for software that can automatically process and extract information from web pages. Despite the fact that the underlying data on most web pages is structured, we cannot automatically process these web sites/pages as structured data. We need robust technologies that can automatically understand human-readable...
متن کاملPage-Level Data Extraction Approach for Web Pages Using Data Mining Techniques
Web data extraction has been an important part for many Web data analysis applications. In this paper, we formulate the data extraction problem as the decoding process of page generation based on structured data and tree templates[1]. We propose a unsupervised, page-level data extraction approach to deduce the schema and templates for each individual Deep Website, contains either singleton or m...
متن کاملExtracting Structured Data from Web Pages (Poster)
Many web sites contain a large collection of “structured” web pages. These pages encode data from an underlying structured source, and are typically generated dynamically. An example of such a collection is the set of book pages in Amazon. There are two important characteristics of such a collection: first, all the pages in the collection contain structured data conforming to a common schema; s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. UCS
دوره 20 شماره
صفحات -
تاریخ انتشار 2014